MLM Trials: Genomic mixed model

Description

This function was developed to solve a mixel model for multi-environmental trials and/or replicated trials when genomic is available. The model includes a semi-parametric term to account for spatial variation when the field layout of the experiment is know. Covariates (fixed effects), genetics (random effect) and spatial term are fitted all in a single step.

Usage

gmm(y,gen,dta=NULL,it=500,bi=200,th=1,model="BRR",...)

Arguments

Numeric vector of phenotypes of length \(obs\) (missing values are allowed). NA is allowed.

gen

Numeric matrix, with dimension \(n\) x \(p\). Attention: Rows must be named with genotype IDs. Missing values are replaced by the SNP expectation.

dta

Data frame with \(obs\) number of rows containg the genotype ID, spatial information and any other set covariates. Make sure to add a column called "ID" (with capital letters) informing the genotype ID with some match in the object gen. For the spatial adjustment, it is necessary to add three numeric columns (not factors) to dta: "Block", "Row" and "Col" (case sensitive), where Block may refer to a field block or an enviroment without blocks Therefore, make sure that blocks from different environments are named differently. Row and Col provide the coordinates for the identification of neighbor plots.

Integer. Total numeric of MCMC iterations used to fit the model.

Integer. Burn-in of MCMC iterations, i.e., number of iteration to be discarted prior to model convergence.

Integer. Thinning parameter: saves only 1 interation every th. Thinning is used to reduce the auto-correlation between Markov chains.

model

Prediction model: The options are: BRR, BayesA, BLASSO, Average, GBLUP, RKHS and RF.

...

Pass arguments to the function that builds the spatial splines NNsrc: rho and dist. By default, rho=1 and dist=3. To check how it looks like in the field, type NNsrc(rho=1,dist=3). If model="RF", it also passes the setting of random forest: number of trees (NTR=50), sample size (SSZ=80), max nodes (MNO=10), node size (NSZ=4), number of variables (MTR=50), importance sampling (IMP=TRUE) and replacement (RPL=TRUE).

Value

The function gmm returns a list containing the fitted values (hat), observed values (obs), intercept (mu, incidence matrix of genotypes (Z) and the vector of breeding values (EBV). If fixed effects are provided, it also returns the design matrices and coefficients of fixed effects (X,b). If the model was kernel or regression, output will include the random effect coefficients (g), variance components of markers (Vg) and residuals (Ve). Kernel models regress the Eigenvectors of the kernel, weighted by the Eigenvalue. The coefficient (cxx) used in the BRR model to convert marker variance \(Vb\) into genetic variance \(Va\). If spatial information is provided, the output includes the fitted spatial term (sp).

Details

The general model is \(y=Xb+Zu+f(x)+e\), where \(y\) is the response variable, \(Xb\) refers to the fixed effects, \(Zu\) regards the genetic effect, \(f(x)\) represents the field variation, and \(e\) is the vector of residuals. In this model \(u\) is a link function that represents the genetic component, which depends on the model specified.

For whole-genome regression models (BRR, BLASSO, BayesA, Average), \(u = Ma\), where \(M\) is the matrix of genotypes. For kernel models (RKHS and GBLUP), \(u=N(0,K\sigma2a)\), where K is either a Gaussian kernel (RKHS) or a linear kernel (GBLUP). For the non-parametric model (RF), \(u\) is updated every round of MCMC and prediction on unobseved genotyped are performed in every round of MCMC and average out in the end. To avoid over-representation of genotypes, \(u\) is not weighted according to the number of observations of each genotype.

Unobserved genotypes not provided in dta but provided in gen are predicted in the output of the function. Genotypes without genotypic information are transfered to the fixed effect (eg. checks). Missing loci are imputed with the expectation. If dta is not provided, the function will work as a regular genomic prediction model, so the length of y must match the number of rows of gen.

In whole-genome regression models, the regularization of the genetic term is either based on chosen prior (t, Gaussian or Laplace) or through the average of three priors. Gaussian (from ridge regression), t (from BayesA) and Laplace (from Bayesian LASSO). Kernel models (GBLUP and RKHS) are regularized as Gaussian process, which is similar to a ridge regression of Eigenvectors where the regularization of Eigenpairs also relies on the Eigenvalues. Random forest is, at some extend, regularized by multiple parameters including sample size (SSZ), numer of trees (NTR), number of variables (MTR), etc.

If there is a large number of trials and users acknowledge the necessity of sparse matrices, we recommend installing the Matrix package and run the following code that enables sparsity:

source(system.file("add","sparseGMM.R",package="NAM"))

Examples

Run this code

# NOT RUN {
# For a demo with wulti-environmental trial type:
# demo(fittingMET)

# Checking heritability
data(tpod)
fit = gmm(y,gen,model = 'BRR')
fit$Vg*fit$cxx / (fit$Vg*fit$cxx+fit$Ve)

# }

Run the code above in your browser using DataLab